14 research outputs found

    An event-driven probabilistic model of sound source localization using cochlea spikes

    Full text link
    This work presents a probabilistic model that estimates the location of sound sources using the output spikes of a silicon cochlea such as the Dynamic Audio Sensor. Unlike previous work which estimated the source locations directly from the interaural time differences (ITDs) extracted from the timing of the cochlea spikes, the spikes are used instead to support a distribution model of the ITDs representing possible locations of sound sources. Results on noisy single speaker recordings show average accuracies of approximately 80% on detecting the correct source locations and an estimation lag of <;100ms

    Optimal Sampling of Parametric Families: Implications for Machine Learning

    Full text link
    It is well known in machine learning that models trained on a training set generated by a probability distribution function perform far worse on test sets generated by a different probability distribution function. In the limit, it is feasible that a continuum of probability distribution functions might have generated the observed test set data; a desirable property of a learned model in that case is its ability to describe most of the probability distribution functions from the continuum equally well. This requirement naturally leads to sampling methods from the continuum of probability distribution functions that lead to the construction of optimal training sets. We study the sequential prediction of Ornstein-Uhlenbeck processes that form a parametric family. We find empirically that a simple deep network trained on optimally constructed training sets using the methods described in this letter can be robust to changes in the test set distribution

    Attention-driven Multi-sensor Selection

    Full text link
    Recent encoder-decoder models for sequence-to-sequence mapping show that integrating both temporal and spatial attention mechanisms into neural networks considerably improve network performance. The use of attention for sensor selection in multi-sensor setups and the benefit of such an attention mechanism is less studied. This work reports on a sensor transformation attention network (STAN) that embeds a sensory attention mechanism to dynamically weigh and combine individual input sensors based on their task-relevant information. We demonstrate the correlation of the attentional signal to changing noise levels of each sensor on the audio-visual GRID dataset and synthetic noise; and on CHiME-4, a multi-microphone real-world noisy dataset. In addition, we demonstrate that the STAN model is able to deal with sensor removal and addition without retraining, and is invariant to channel order. Compared to a two-sensor model that weighs both sensors equally, the equivalent STAN model has a relative parameter increase of only 0.09%, but reduces the relative character error rate (CER) by up to 19.1% on the CHiME-4 dataset. The attentional signal helps to identify a lower SNR sensor with up to 94.2% accuracy

    Event-driven Pipeline for Low-latency Low-compute Keyword Spotting and Speaker Verification System

    Full text link
    This work presents an event-driven acoustic sensor processing pipeline to power a low-resource voice-activated smart assistant. The pipeline includes four major steps; namely localization, source separation, keyword spotting (KWS) and speaker verification (SV). The pipeline is driven by a front-end binaural spiking silicon cochlea sensor. The timing information carried by the output spikes of the cochlea provide spatial cues for localization and source separation. Spike features are generated with low latencies from the separated source spikes and are used by both KWS and SV which rely on state-of-the-art deep recurrent neural network architectures with a small memory footprint. Evaluation on a self-recorded event dataset based on TIDIGITS shows accuracies of over 93% and 88% on KWS and SV respectively, with minimum system latency of 5 ms on a limited resource device

    Real-Time Speech Recognition for IoT Purpose using a Delta Recurrent Neural Network Accelerator

    Full text link
    This paper describes a continuous speech recognition hardware system that uses a delta recurrent neural network accelerator (DeltaRNN) implemented on a Xilinx Zynq-7100 FPGA to enable low latency recurrent neural network (RNN) computation. The implemented network consists of a single-layer RNN with 256 gated recurrent unit (GRU) neurons and is driven by input features generated either from the output of a filter bank running on the ARM core of the FPGA in a PmodMic3 microphone setup or from the asynchronous outputs of a spiking silicon cochlea circuit. The microphone setup achieves 7.1 ms minimum latency and 177 frames-per-second (FPS) maximum throughput while the cochlea setup achieves 2.9 ms minimum latency and 345 FPS maximum throughput. The low latency and 70 mW power consumption of the DeltaRNN makes it suitable as an IoT computing platform

    Live Demonstration: Real-Time Spoken Digit Recognition using the DeltaRNN Accelerator

    Full text link
    This demonstration shows a real-time continuous speech recognition hardware system using our previously published DeltaRNN accelerator that enables low latency recurrent neural network (RNN) computation. The network is trained on augmented audio samples from the TIDIGITS dataset to achieve a label error rate (LER) of 2.31%. It is implemented on a Xilinx Zynq-7100 FPGA running at 1 MHz. The incremental RNN power consumption is 30 mW. Visitors interact with the system by speaking digits into a microphone connected to the FPGA system and the classification outputs of the network are continuously displayed on a laptop screen in real time

    Feature Representations for Neuromorphic Audio Spike Streams

    Get PDF
    Event-driven neuromorphic spiking sensors such as the silicon retina and the silicon cochlea encode the external sensory stimuli as asynchronous streams of spikes across different channels or pixels. Combining state-of-art deep neural networks with the asynchronous outputs of these sensors has produced encouraging results on some datasets but remains challenging. While the lack of effective spiking networks to process the spike streams is one reason, the other reason is that the pre-processing methods required to convert the spike streams to frame-based features needed for the deep networks still require further investigation. This work investigates the effectiveness of synchronous and asynchronous frame-based features generated using spike count and constant event binning in combination with the use of a recurrent neural network for solving a classification task using N-TIDIGITS18 dataset. This spike-based dataset consists of recordings from the Dynamic Audio Sensor, a spiking silicon cochlea sensor, in response to the TIDIGITS audio dataset. We also propose a new pre-processing method which applies an exponential kernel on the output cochlea spikes so that the interspike timing information is better preserved. The results from the N-TIDIGITS18 dataset show that the exponential features perform better than the spike count features, with over 91% accuracy on the digit classification task. This accuracy corresponds to an improvement of at least 2.5% over the use of spike count features, establishing a new state of the art for this dataset
    corecore